import numpy as np
import pandas as pd
import requests
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.options.plotting.backend = "plotly"
In the cell below, gather all three pieces of data for this project and load them in the notebook. Note: the methods required to gather each data are different.
archive = pd.read_csv('twitter-archive-enhanced.csv')
url = ('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-'
'predictions.tsv')
response = requests.get(url)
with open('image-predictions.tsv', mode='wb') as file:
file.write(response.content)
img_predictions = pd.read_csv('image-predictions.tsv', sep = '\t')
Note: The read_csv function can read TSVs straight from URL.
img_predictions = pd.read_csv(url, sep = '\t')
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
# This loop will likely take 20-30 minutes to run because of Twitter's rate limit
for tweet_id in tweet_ids:
count += 1
print(str(count) + ": " + str(tweet_id))
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
print("Success")
json.dump(tweet._json, outfile)
outfile.write('\n')
except tweepy.TweepError as e:
print("Fail")
fails_dict[tweet_id] = e
pass
end = timer()
print(end - start)
print(fails_dict)
api_data = pd.read_json('tweet_json.txt', lines=True)
api_data = api_data[['id','retweet_count','favorite_count']]
In this section, detect and document at least eight (8) quality issues and two (2) tidiness issue. You must use both visual assessment programmatic assessement to assess the data.
Note: pay attention to the following key points when you access the data.
archive.head()
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> | This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643555336193/photo/1 | 13 | 10 | Phineas | NaN | NaN | NaN | NaN |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> | This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421306343426/photo/1 | 13 | 10 | Tilly | NaN | NaN | NaN | NaN |
| 2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> | This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181378084864/photo/1 | 12 | 10 | Archie | NaN | NaN | NaN | NaN |
| 3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> | This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557279858688/photo/1 | 13 | 10 | Darla | NaN | NaN | NaN | NaN |
| 4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> | This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 | 12 | 10 | Franklin | NaN | NaN | NaN | NaN |
archive['source'].value_counts()
source <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> 2221 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a> 91 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> 33 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> 11 Name: count, dtype: int64
archive['text'].head()
0 This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU 1 This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV 2 This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB 3 This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ 4 This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Name: text, dtype: object
archive.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 1611 non-null object 13 doggo 97 non-null object 14 floofer 10 non-null object 15 pupper 257 non-null object 16 puppo 30 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB
numerator_count = archive['rating_numerator'].value_counts().sort_index()
# Transposing the data saves screen space.
numerator_count_df_T = pd.DataFrame(numerator_count).T
numerator_count_df_T.rename({'rating_numerator':'numerator_count'}, inplace=True)
numerator_count_df_T
| rating_numerator | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 17 | 20 | 24 | 26 | 27 | 44 | 45 | 50 | 60 | 75 | 80 | 84 | 88 | 99 | 121 | 143 | 144 | 165 | 182 | 204 | 420 | 666 | 960 | 1776 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2 | 9 | 9 | 19 | 17 | 37 | 32 | 55 | 102 | 158 | 461 | 464 | 558 | 351 | 54 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |
archive.query('rating_numerator > 14')[['text','rating_numerator','rating_denominator']].head(10)
| text | rating_numerator | rating_denominator | |
|---|---|---|---|
| 55 | @roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s | 17 | 10 |
| 188 | @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research | 420 | 10 |
| 189 | @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10 | 666 | 10 |
| 285 | RT @KibaDva: I collected all the good dogs!! 15/10 @dog_rates #GoodDogs https://t.co/6UCGFczlOI | 15 | 10 |
| 290 | @markhoppus 182/10 | 182 | 10 |
| 291 | @bragg6of8 @Andy_Pace_ we are still looking for the first 15/10 | 15 | 10 |
| 313 | @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho | 960 | 0 |
| 340 | RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu… | 75 | 10 |
| 433 | The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd | 84 | 70 |
| 516 | Meet Sam. She smiles 24/7 & secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx | 24 | 7 |
📝 The 84/70 rating is for a group of dogs. This warrants further analysis...
denominator_count = archive['rating_denominator'].value_counts().sort_index()
# Transposing the data saves screen space.
denominator_count_df_T = pd.DataFrame(denominator_count).T
denominator_count_df_T.rename({'rating_denominator':'denominator_count'}, inplace=True)
denominator_count_df_T
| rating_denominator | 0 | 2 | 7 | 10 | 11 | 15 | 16 | 20 | 40 | 50 | 70 | 80 | 90 | 110 | 120 | 130 | 150 | 170 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1 | 1 | 1 | 2333 | 3 | 1 | 1 | 2 | 1 | 3 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
img_predictions.head()
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
| 1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
| 2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
| 3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
| 4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
img_predictions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 img_num 2075 non-null int64 3 p1 2075 non-null object 4 p1_conf 2075 non-null float64 5 p1_dog 2075 non-null bool 6 p2 2075 non-null object 7 p2_conf 2075 non-null float64 8 p2_dog 2075 non-null bool 9 p3 2075 non-null object 10 p3_conf 2075 non-null float64 11 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
api_data.head()
| id | retweet_count | favorite_count | |
|---|---|---|---|
| 0 | 892420643555336193 | 8853 | 39467 |
| 1 | 892177421306343426 | 6514 | 33819 |
| 2 | 891815181378084864 | 4328 | 25461 |
| 3 | 891689557279858688 | 8964 | 42908 |
| 4 | 891327558926688256 | 9774 | 41048 |
api_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 2354 non-null int64 1 retweet_count 2354 non-null int64 2 favorite_count 2354 non-null int64 dtypes: int64(3) memory usage: 55.3 KB
Includes retweets. Per the retweeted_ columns, there are 181 tweets.
Includes tweets without URLs. There are 2356 entries, but only 2297 of them have expanded_urls.
The source data includes HTML tags.
The text data includes URLs referring back to the tweet itself. These URLs are shortened forms of those found in the extended_urls column.
Some of the expanded_urls values consist of duplicate URLs strung together, separated by commas.
The rating_numerator column
text of each tweet—whether it's the rating or not.The rating_denominator column likewise lists the first denominator found in the text of each tweet—whether it's the rating or not.
The id columns should be object-type, not float or int. And the timestamp column should be datetime.
The tweet data spans multiple tables. The api_data consists of additional data on the tweets in archive, and the img_predictions are largely for the images in those tweets. As all three tables pertain to the same crop of tweets, they should be merged into a super table.
The "dog stage" data spans multiple columns: doggo, floofer, pupper, and puppo are values of the "dog stage" variable.
In this section, clean all of the issues you documented while assessing.
Note: Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of tidy data. The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).
# Make copies of original pieces of data
archive_clean = archive.copy()
api_data_clean = api_data.copy()
img_predict_clean = img_predictions.copy()
retweeted_ data.* And drop the now useless retweeted_ columns.¶*If a tweet has `retweeted_` data, it's a retweet.
archive_clean.query('retweeted_status_id.isna()', inplace=True)
archive_clean.drop(columns=['retweeted_status_id',
'retweeted_status_user_id',
'retweeted_status_timestamp',
], inplace=True)
archive_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 2175 entries, 0 to 2355 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2175 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2175 non-null object 4 source 2175 non-null object 5 text 2175 non-null object 6 expanded_urls 2117 non-null object 7 rating_numerator 2175 non-null int64 8 rating_denominator 2175 non-null int64 9 name 1495 non-null object 10 doggo 87 non-null object 11 floofer 10 non-null object 12 pupper 234 non-null object 13 puppo 25 non-null object dtypes: float64(2), int64(3), object(9) memory usage: 254.9+ KB
expanded_urls.¶archive_clean.dropna(subset='expanded_urls', inplace=True)
archive_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 2117 entries, 0 to 2355 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2117 non-null int64 1 in_reply_to_status_id 23 non-null float64 2 in_reply_to_user_id 23 non-null float64 3 timestamp 2117 non-null object 4 source 2117 non-null object 5 text 2117 non-null object 6 expanded_urls 2117 non-null object 7 rating_numerator 2117 non-null int64 8 rating_denominator 2117 non-null int64 9 name 1495 non-null object 10 doggo 84 non-null object 11 floofer 10 non-null object 12 pupper 232 non-null object 13 puppo 24 non-null object dtypes: float64(2), int64(3), object(9) memory usage: 248.1+ KB
# The key columns are variously named `tweet_id` and `id`.
api_data_clean.rename(columns={'id':'tweet_id'}, inplace=True)
img_predict_clean.rename(columns={'id':'tweet_id'}, inplace=True)
archive_clean = pd.merge(api_data_clean, archive_clean,
how='inner', on='tweet_id')
archive_clean = pd.merge(archive_clean, img_predict_clean,
how='inner', on='tweet_id')
archive_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1994 entries, 0 to 1993 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1994 non-null int64 1 retweet_count 1994 non-null int64 2 favorite_count 1994 non-null int64 3 in_reply_to_status_id 23 non-null float64 4 in_reply_to_user_id 23 non-null float64 5 timestamp 1994 non-null object 6 source 1994 non-null object 7 text 1994 non-null object 8 expanded_urls 1994 non-null object 9 rating_numerator 1994 non-null int64 10 rating_denominator 1994 non-null int64 11 name 1448 non-null object 12 doggo 74 non-null object 13 floofer 8 non-null object 14 pupper 212 non-null object 15 puppo 23 non-null object 16 jpg_url 1994 non-null object 17 img_num 1994 non-null int64 18 p1 1994 non-null object 19 p1_conf 1994 non-null float64 20 p1_dog 1994 non-null bool 21 p2 1994 non-null object 22 p2_conf 1994 non-null float64 23 p2_dog 1994 non-null bool 24 p3 1994 non-null object 25 p3_conf 1994 non-null float64 26 p3_dog 1994 non-null bool dtypes: bool(3), float64(5), int64(6), object(13) memory usage: 379.8+ KB
doggo, floofer, pupper, and puppo are values of the "dog stage" variable.¶Most of the values are null.
dog_stage column, but only the non-null values. Drop the "dog stage" columns—save dog_stage.¶dog_stage_cols = ['doggo', 'pupper', 'floofer', 'puppo']
# Update! Pandas is no longer reading None as "None" but as nan instead
archive_clean[dog_stage_cols] = archive_clean[dog_stage_cols].fillna('None')
f = lambda x: ', '.join(x[x != 'None'])
archive_clean['dog_stage'] = archive_clean[dog_stage_cols].apply(f, axis = 1)
archive_clean.drop(columns=dog_stage_cols, inplace=True)
archive_clean['dog_stage'].loc[10:14]
10 11 12 puppo 13 14 puppo Name: dog_stage, dtype: object
archive_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1994 entries, 0 to 1993 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1994 non-null int64 1 retweet_count 1994 non-null int64 2 favorite_count 1994 non-null int64 3 in_reply_to_status_id 23 non-null float64 4 in_reply_to_user_id 23 non-null float64 5 timestamp 1994 non-null object 6 source 1994 non-null object 7 text 1994 non-null object 8 expanded_urls 1994 non-null object 9 rating_numerator 1994 non-null int64 10 rating_denominator 1994 non-null int64 11 name 1448 non-null object 12 jpg_url 1994 non-null object 13 img_num 1994 non-null int64 14 p1 1994 non-null object 15 p1_conf 1994 non-null float64 16 p1_dog 1994 non-null bool 17 p2 1994 non-null object 18 p2_conf 1994 non-null float64 19 p2_dog 1994 non-null bool 20 p3 1994 non-null object 21 p3_conf 1994 non-null float64 22 p3_dog 1994 non-null bool 23 dog_stage 1994 non-null object dtypes: bool(3), float64(5), int64(6), object(10) memory usage: 333.1+ KB
archive_clean['dog_stage'].unique()
array(['', 'doggo', 'puppo', 'pupper', 'floofer', 'doggo, puppo',
'doggo, floofer', 'doggo, pupper'], dtype=object)
archive_clean['dog_stage'].replace('', None, inplace=True)
archive_clean['dog_stage'].loc[10:14]
10 None 11 None 12 puppo 13 None 14 puppo Name: dog_stage, dtype: object
archive_clean['dog_stage'].info()
<class 'pandas.core.series.Series'> RangeIndex: 1994 entries, 0 to 1993 Series name: dog_stage Non-Null Count Dtype -------------- ----- 306 non-null object dtypes: object(1) memory usage: 15.7+ KB
source data includes HTML tags.¶archive_clean['source'] = archive_clean['source'].str.extract(r'(>.+<)')
archive_clean['source'] = archive_clean['source'].str.strip('><')
archive_clean['source'].unique()
array(['Twitter for iPhone', 'Twitter Web Client', 'TweetDeck'],
dtype=object)
text data includes URLs referring back to the tweet itself.¶regex = r' https:\/\/t.co\/[a-zA-Z0-9]+'
archive_clean['text'] = archive_clean['text'].str.replace(regex, '', regex=True)
archive_clean['text'].str.contains('https://t.co').sum()
26
archive_clean.query('text.str.contains("https://t.co")')['text'].head()
6 Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H 26 This is Mingus. He's a wonderful father to his smol pup. Confirmed 13/10, but he needs your help\n\nhttps://t.co/bVi0Yr4Cff 63 Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps://t.co/245xJJElsY 80 This is Sierra. She's one precious pupper. Absolute 12/10. Been in and out of ICU her whole life. Help Sierra below\n\nhttps://t.co/Xp01EU3qyD 102 Say hello to Cooper. His expression is the same wet or dry. Absolute 12/10 but Coop desperately requests your help\n\nhttps://t.co/ZMTE4Mr69f Name: text, dtype: object
✅ These URLs—separated by newlines—point to fundraiser pages: they are part of the tweet.
expanded_urls values consist of duplicate URLs strung together, separated by commas.¶archive_clean['expanded_urls'] = archive_clean['expanded_urls'].str.split(',')
f = lambda x: [*set(x)]
archive_clean['expanded_urls'] = archive_clean['expanded_urls'].map(f)
multi_url = []
[multi_url.append(i) for i in archive_clean['expanded_urls'] if len(i) > 1]
multi_url[:5]
[['https://twitter.com/dog_rates/status/890971913173991426/photo/1', 'https://gofundme.com/ydvmve-surgery-for-jax'], ['https://twitter.com/dog_rates/status/886736880519319552/photo/1', 'https://www.gofundme.com/mingusneedsus'], ['https://twitter.com/dog_rates/status/878281511006478336/photo/1', 'https://www.gofundme.com/3yd6y1c'], ['https://www.gofundme.com/help-my-baby-sierra-get-better', 'https://twitter.com/dog_rates/status/873213775632977920/photo/1'], ['https://twitter.com/dog_rates/status/870656317836468226/photo/1', 'https://www.gofundme.com/help-fix-codys-torn-acl']]
⚠️ While not duplicates per se, these fundraiser URLs are somewhat redundant with their shortened versions in the test data (above).
rating_numerator column¶text of each tweet—whether it's the rating or not.rating_numerator values with complete rating numerators—integers and all—extracted from the text data.¶# Find all sequences resembling a numerator and select the last one. The ratings are at
# the end.
archive_clean['rating_numerator'] = archive_clean['text'].str.findall(r'(\d*\.?\d+\/)').str[-1]
archive_clean['rating_numerator'] = archive_clean['rating_numerator'].str.rstrip('/')
archive_clean.query('text.str.contains(r"\d+\.\d+\/")')[['text','rating_numerator']]
| text | rating_numerator | |
|---|---|---|
| 39 | This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 | 13.5 |
| 503 | This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 | 9.75 |
| 553 | This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back | 11.27 |
| 1374 | Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 | 11.26 |
archive_clean[['text','rating_numerator']]
| text | rating_numerator | |
|---|---|---|
| 0 | This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 | 13 |
| 1 | This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 | 13 |
| 2 | This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 | 12 |
| 3 | This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us | 13 |
| 4 | This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek | 12 |
| ... | ... | ... |
| 1989 | Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 | 5 |
| 1990 | This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 | 6 |
| 1991 | Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af | 9 |
| 1992 | This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of | 7 |
| 1993 | Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet | 8 |
1994 rows × 2 columns
archive_clean['rating_numerator'].unique()
array(['13', '12', '14', '13.5', '11', '6', '10', '0', '84', '24', '9.75',
'5', '11.27', '3', '.10', '7', '8', '9', '4', '165', '1776', '.11',
'.9', '204', '99', '80', '.12', '45', '60', '2', '44', '.8', '143',
'121', '11.26', '144', '88', '1', '420'], dtype=object)
archive_clean.query('rating_numerator == ".10"')[['text','rating_numerator']].head()
| text | rating_numerator | |
|---|---|---|
| 586 | This is Finn. He's very nervous for the game. Has a lot of money riding on it.10/10 would attempt to comfort | .10 |
| 734 | What jokester sent in a pic without a dog in it? This is not @rock_rates. This is @dog_rates. Thank you ...10/10 | .10 |
| 751 | Guys pls stop sending actual sharks. It's too dangerous for me and the people taking the photos. Thank you ...10/10 | .10 |
| 803 | This is getting incredibly frustrating. This is a Mexican Golden Beaver. We only rate dogs. Only send dogs ...10/10 | .10 |
| 808 | This... is a Tyrannosaurus rex. We only rate dogs. Please only send in dogs. Thank you ...10/10 | .10 |
rating_numerator values.¶archive_clean['rating_numerator'] = archive_clean['rating_numerator'].str.lstrip('.')
archive_clean['rating_numerator'].unique()
array(['13', '12', '14', '13.5', '11', '6', '10', '0', '84', '24', '9.75',
'5', '11.27', '3', '7', '8', '9', '4', '165', '1776', '204', '99',
'80', '45', '60', '2', '44', '143', '121', '11.26', '144', '88',
'1', '420'], dtype=object)
rating_denominator column lists the first denominator* encountered in the text of each tweet.¶Conversely, the rating denominators come at the end.
text starting from the end.¶# The `[\s.]` verifies that the pattern isn't part of a URL. And the `\/\d+$` matches ratings at
# the very end of the text.
archive_clean['rating_denominator'] = archive_clean['text'].str.findall(r'(\/\d+[\s.,!]+|\/\d+$)').str[-1]
archive_clean['rating_denominator'] = archive_clean['rating_denominator'].str.strip('/.')
archive_clean.query('rating_denominator.isna()')[['text','rating_numerator']].head()
| text | rating_numerator | |
|---|---|---|
| 723 | Meet Jax & Jil. Jil is yelling the pledge of allegiance. If u cant take the freedom get out the kitchen Jax. 10/10s | 10 |
| 855 | "Challenge completed" \n(pupgraded to 12/10) | 12 |
| 1304 | Gang of fearless hoofed puppers here. Straight savages. Elevated for extra terror. Front one has killed before 6/10s | 6 |
| 1332 | Meet Joey and Izzy. Joey only has one ear that works and Izzy wants 2015 to be over already. Both great pups. 11/10s | 11 |
| 1496 | 🎶 HELLO FROM THE OTHER SIIIIIIIIDE 🎶 10/10s | 10 |
archive_clean['rating_denominator'].unique()
array(['10', '10 ', '10, ', '10. ', '10\n\n', '10 ', '10 \n', '70', '7 ',
'150', nan, '170 ', '10\n', '90 ', '80', '10! ', '50 ', '40 ',
'10 \n\n', '130. ', '110', '120 ', '80 '], dtype=object)
archive_clean['rating_denominator'] = archive_clean['text'].str.findall(r'(\/\d+[\s.,!)s]+|\/\d+$)').str[-1]
archive_clean['rating_denominator'] = archive_clean['rating_denominator'].str.strip('/.,!)s \n')
archive_clean['rating_denominator'].unique()
array(['10', '70', '7', '150', '170', '90', '80', '50', '40', '130',
'110', '120'], dtype=object)
archive_clean.query('rating_denominator != "10"')[['text','rating_denominator']].head()
| text | rating_denominator | |
|---|---|---|
| 323 | The floofs have been released I repeat the floofs have been released. 84/70 | 70 |
| 385 | Meet Sam. She smiles 24/7 & secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t | 7 |
| 662 | Why does this never happen at my front door... 165/150 | 150 |
| 848 | Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once | 170 |
| 946 | Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody | 90 |
id columns should be object-type, not float or int. And the timestamp column should be datetime.¶id columns as object-type and the timestamp column as datetime.¶id_col_type = {'tweet_id': object,
'in_reply_to_status_id': object,
'in_reply_to_user_id': object,
}
archive_clean = archive_clean.astype(id_col_type)
archive_clean['timestamp'] = pd.to_datetime(archive_clean['timestamp'])
archive_clean.dtypes
tweet_id object retweet_count int64 favorite_count int64 in_reply_to_status_id object in_reply_to_user_id object timestamp datetime64[ns, UTC] source object text object expanded_urls object rating_numerator object rating_denominator object name object jpg_url object img_num int64 p1 object p1_conf float64 p1_dog bool p2 object p2_conf float64 p2_dog bool p3 object p3_conf float64 p3_dog bool dog_stage object dtype: object
rating_ columns were overwritten with number strings.¶rating_ columns as float-type.¶col_type_plus = {'rating_numerator': float,
'rating_denominator': float,
}
archive_clean = archive_clean.astype(col_type_plus)
archive_clean[['rating_numerator', 'rating_denominator']].dtypes
rating_numerator float64 rating_denominator float64 dtype: object
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".
archive_clean.to_csv('twitter_archive_master.csv', index=False)
df = pd.read_csv('twitter_archive_master.csv', dtype=id_col_type, parse_dates=['timestamp'])
In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.
rating_cols = ['rating_numerator','rating_denominator','text','jpg_url']
df[rating_cols].query('rating_denominator > 10')
| rating_numerator | rating_denominator | text | jpg_url | |
|---|---|---|---|---|
| 323 | 84.0 | 70.0 | The floofs have been released I repeat the floofs have been released. 84/70 | https://pbs.twimg.com/media/C2OtWr0VQAEnS9r.jpg |
| 662 | 165.0 | 150.0 | Why does this never happen at my front door... 165/150 | https://pbs.twimg.com/ext_tw_video_thumb/758467147756691456/pu/img/YTNzjRFDSPNXukmM.jpg |
| 848 | 204.0 | 170.0 | Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once | https://pbs.twimg.com/media/CiWWhVNUYAAab_r.jpg |
| 946 | 99.0 | 90.0 | Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody | https://pbs.twimg.com/media/CehIzzZWQAEyHH5.jpg |
| 970 | 80.0 | 80.0 | Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 | https://pbs.twimg.com/media/CdzETn4W4AAVU5N.jpg |
| 988 | 45.0 | 50.0 | From left to right:\nCletus, Jerome, Alejandro, Burp, & Titson\nNone know where camera is. 45/50 would hug all at once | https://pbs.twimg.com/media/CdeUKpcWoAAJAWJ.jpg |
| 1054 | 60.0 | 50.0 | Here is a whole flock of puppers. 60/50 I'll take the lot | https://pbs.twimg.com/media/CcVOJEcXEAM0FHL.jpg |
| 1130 | 44.0 | 40.0 | Happy Wednesday here's a bucket of pups. 44/40 would pet all at once | https://pbs.twimg.com/media/Ca3i7CzXIAMLhg8.jpg |
| 1302 | 143.0 | 130.0 | Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you | https://pbs.twimg.com/media/CX7br3HWsAAQ9L1.jpg |
| 1303 | 121.0 | 110.0 | Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 | https://pbs.twimg.com/media/CX7Y_ByWwAEJdUy.jpg |
| 1435 | 144.0 | 120.0 | IT'S PUPPERGEDDON. Total of 144/120 ...I think | https://pbs.twimg.com/media/CWe7kw9W4AE8UJh.jpg |
| 1494 | 88.0 | 80.0 | Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once | https://pbs.twimg.com/media/CWEcxqWVEAAHyGH.jpg |
rating_denominator is $10 \times n$, where $n$ is the number of dogs.¶# Normalization of ratings
df['normalized_rating'] = df['rating_numerator'] / df['rating_denominator'] * 10
df['normalized_rating'].describe()
count 1994.000000 mean 11.646472 std 40.643919 min 0.000000 25% 10.000000 50% 11.000000 75% 12.000000 max 1776.000000 Name: normalized_rating, dtype: float64
q1, q3 = df['normalized_rating'].quantile([0.25, 0.75])
iqr = q3 - q1
lower_limit = q1 - 1.5*iqr
upper_limit = q3 + 1.5*iqr
df_sans_outlier = df.query('normalized_rating >= @lower_limit & normalized_rating <= @upper_limit')
df_sans_outlier['normalized_rating'].describe()
count 1875.000000 mean 10.930016 std 1.562598 min 7.000000 25% 10.000000 50% 11.000000 75% 12.000000 max 14.000000 Name: normalized_rating, dtype: float64
# Pre-normalization
df['rating_numerator'].mean()
12.215035105315946
df_outliers = df.query('normalized_rating < @lower_limit | normalized_rating > @upper_limit')
df_outliers[rating_cols].query('rating_numerator > @upper_limit')
| rating_numerator | rating_denominator | text | jpg_url | |
|---|---|---|---|---|
| 385 | 24.0 | 7.0 | Meet Sam. She smiles 24/7 & secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t | https://pbs.twimg.com/media/C0EyPZbXAAAceSc.jpg |
| 726 | 1776.0 | 10.0 | This is Atticus. He's quite simply America af. 1776/10 | https://pbs.twimg.com/media/CmgBZ7kWcAAlzFD.jpg |
| 1718 | 420.0 | 10.0 | After so many requests... here you go.\n\nGood dogg. 420/10 | https://pbs.twimg.com/media/CU9P717W4AAOlKx.jpg |
conf_cols = ['p1_conf','p2_conf','p3_conf']
df_sans_outlier[conf_cols].describe()
| p1_conf | p2_conf | p3_conf | |
|---|---|---|---|
| count | 1875.000000 | 1.875000e+03 | 1.875000e+03 |
| mean | 0.598726 | 1.339234e-01 | 5.989640e-02 |
| std | 0.271115 | 1.004777e-01 | 5.096227e-02 |
| min | 0.044333 | 1.011300e-08 | 1.740170e-10 |
| 25% | 0.367655 | 5.352315e-02 | 1.600860e-02 |
| 50% | 0.599076 | 1.175080e-01 | 4.899880e-02 |
| 75% | 0.848720 | 1.939745e-01 | 9.153815e-02 |
| max | 1.000000 | 4.880140e-01 | 2.734190e-01 |
# If we are to analyze the tweets on a per-breed basis, we need to be reasonably sure that
# a Pug is—in fact—a Pug.
by_breed = df_sans_outlier.query('p1_conf >= 0.80 & p1_dog == True').groupby('p1')
num_dogs = by_breed['normalized_rating'].count()
num_dogs.rename('number_of_dogs', inplace=True)
p1
Afghan_hound 1
Airedale 3
American_Staffordshire_terrier 5
Bernese_mountain_dog 4
Blenheim_spaniel 6
..
schipperke 3
soft-coated_wheaten_terrier 2
toy_poodle 11
vizsla 5
whippet 1
Name: number_of_dogs, Length: 74, dtype: int64
mean_rating = by_breed['normalized_rating'].mean()
mean_rating.rename('mean_rating', inplace=True)
p1
Afghan_hound 13.000000
Airedale 11.000000
American_Staffordshire_terrier 11.200000
Bernese_mountain_dog 11.250000
Blenheim_spaniel 11.166667
...
schipperke 11.000000
soft-coated_wheaten_terrier 11.500000
toy_poodle 11.363636
vizsla 10.800000
whippet 9.000000
Name: mean_rating, Length: 74, dtype: float64
df_by_breed = pd.DataFrame([num_dogs, mean_rating]).T
df_by_breed.sort_values('mean_rating')
| number_of_dogs | mean_rating | |
|---|---|---|
| p1 | ||
| whippet | 1.0 | 9.0 |
| English_setter | 1.0 | 9.0 |
| Italian_greyhound | 3.0 | 9.0 |
| basenji | 2.0 | 9.5 |
| keeshond | 3.0 | 10.0 |
| ... | ... | ... |
| briard | 1.0 | 13.0 |
| Leonberg | 1.0 | 13.0 |
| Afghan_hound | 1.0 | 13.0 |
| black-and-tan_coonhound | 1.0 | 14.0 |
| Gordon_setter | 1.0 | 14.0 |
74 rows × 2 columns
# In this case, slicing the `mean_rating` column outputs the data more legibly: monospaced values
# make up-and-down comparisons easier.
df_by_breed.groupby('number_of_dogs')['mean_rating'].mean().sort_index()
number_of_dogs 1.0 11.403889 2.0 10.708333 3.0 10.857143 4.0 10.714286 5.0 10.828571 6.0 10.666667 7.0 11.285714 8.0 11.125000 9.0 10.888889 11.0 11.363636 14.0 11.357143 16.0 11.875000 18.0 10.500000 21.0 11.190476 22.0 11.909091 30.0 10.600000 33.0 11.515152 38.0 11.973684 63.0 11.880952 Name: mean_rating, dtype: float64
# Update! The numeric columns are no longer selected automatically
by_breed_mean = by_breed[['normalized_rating', 'favorite_count']].mean()
fig = by_breed_mean.plot.scatter(x='normalized_rating',
y='favorite_count',
labels={'normalized_rating':'Normalized Rating',
'favorite_count':'Favorite Count',
},
title='Breed Rating vs Popular Preference',
hover_name=by_breed_mean.index,
trendline='ols')
fig.update_layout(title={'x':0.5,
'xanchor':'center',
'y':0.85,
})
fig.show()
# Finding the correlation coefficient (r) using the R-squared—sourced from
# the OLS trendline tooltip (above).
np.sqrt(0.357845)
0.5982014710780976
The rating_denominator is $10 \times n$, where $n$ is the number of dogs.
1.1. Normalized and stripped of outliers, the mean rating is 10.9 out of 10.
The neural network is not very confident in its dog breed predictions.
Being a unique breed appears to give a ratings boost.
The ratings correlate quite well with popular preference.